88 research outputs found

    Coresets-Methods and History: A Theoreticians Design Pattern for Approximation and Streaming Algorithms

    Get PDF
    We present a technical survey on the state of the art approaches in data reduction and the coreset framework. These include geometric decompositions, gradient methods, random sampling, sketching and random projections. We further outline their importance for the design of streaming algorithms and give a brief overview on lower bounding techniques

    Probabilistic Smallest Enclosing Ball in High Dimensions via Subgradient Sampling

    Get PDF
    We study a variant of the median problem for a collection of point sets in high dimensions. This generalizes the geometric median as well as the (probabilistic) smallest enclosing ball (pSEB) problems. Our main objective and motivation is to improve the previously best algorithm for the pSEB problem by reducing its exponential dependence on the dimension to linear. This is achieved via a novel combination of sampling techniques for clustering problems in metric spaces with the framework of stochastic subgradient descent. As a result, the algorithm becomes applicable to shape fitting problems in Hilbert spaces of unbounded dimension via kernel functions. We present an exemplary application by extending the support vector data description (SVDD) shape fitting method to the probabilistic case. This is done by simulating the pSEB algorithm implicitly in the feature space induced by the kernel function

    Random projections for Bayesian regression

    Get PDF
    This article deals with random projections applied as a data reduction technique for Bayesian regression analysis. We show sufficient conditions under which the entire dd-dimensional distribution is approximately preserved under random projections by reducing the number of data points from nn to k∈O(poly⁥(d/Δ))k\in O(\operatorname{poly}(d/\varepsilon)) in the case n≫dn\gg d. Under mild assumptions, we prove that evaluating a Gaussian likelihood function based on the projected data instead of the original data yields a (1+O(Δ))(1+O(\varepsilon))-approximation in terms of the ℓ2\ell_2 Wasserstein distance. Our main result shows that the posterior distribution of Bayesian linear regression is approximated up to a small error depending on only an Δ\varepsilon-fraction of its defining parameters. This holds when using arbitrary Gaussian priors or the degenerate case of uniform distributions over Rd\mathbb{R}^d for ÎČ\beta. Our empirical evaluations involve different simulated settings of Bayesian linear regression. Our experiments underline that the proposed method is able to recover the regression model up to small error while considerably reducing the total running time

    On large-scale probabilistic and statistical data analysis

    Get PDF
    In this manuscript we develop and apply modern algorithmic data reduction techniques to tackle scalability issues and enable statistical data analysis of massive data sets. Our algorithms follow a general scheme, where a reduction technique is applied to the large-scale data to obtain a small summary of sublinear size to which a classical algorithm is applied. The techniques for obtaining these summaries depend on the problem that we want to solve. The size of the summaries is usually parametrized by an approximation parameter, expressing the trade-off between efficiency and accuracy. In some cases the data can be reduced to a size that has no or only negligible dependency on the initial number of data items. However, for other problems it turns out that sublinear summaries do not exist in the worst case. In such situations, we exploit statistical or geometric relaxations to obtain useful sublinear summaries under certain mildness assumptions. We present, in particular, the data reduction methods called coresets and subspace embeddings, and several algorithmic techniques to construct these via random projections and sampling

    Glow Discharge Optical Emission Spectrometry (GDOES), an Effectiveness Method for Characterizing Composition of Surfaces and Coatings

    Get PDF
    Within the frame of this work, the technical procedures and real advantages of using Glow Discharge Optical Emission Spectroscopy (GDOES) for establishing depth concentration profiles of surfaces are presented. GDOES can detect low concentrations with high accuracy. It can be used for either quantitative bulk analysis (QBA) or quantitative depth profiling (QDP) in the nanometer to micron range. Non-conductive and conductive samples can be analysed. The main applications of this spectral method are related to different technology fields such as: heat treatment processes, casting, heat and cold forming processes, thermochemical treatments, electro-chemical processes (galvanic coatings), chemical and physical vapour depositions (CVD, PVD), thermal oxidation processes and anodizing, thin-films and others

    Optimal Sketching Bounds for Sparse Linear Regression

    Full text link
    We study oblivious sketching for kk-sparse linear regression under various loss functions such as an ℓp\ell_p norm, or from a broad class of hinge-like loss functions, which includes the logistic and ReLU losses. We show that for sparse ℓ2\ell_2 norm regression, there is a distribution over oblivious sketches with Θ(klog⁥(d/k)/Δ2)\Theta(k\log(d/k)/\varepsilon^2) rows, which is tight up to a constant factor. This extends to ℓp\ell_p loss with an additional additive O(klog⁥(k/Δ)/Δ2)O(k\log(k/\varepsilon)/\varepsilon^2) term in the upper bound. This establishes a surprising separation from the related sparse recovery problem, which is an important special case of sparse regression. For this problem, under the ℓ2\ell_2 norm, we observe an upper bound of O(klog⁥(d)/Δ+klog⁥(k/Δ)/Δ2)O(k \log (d)/\varepsilon + k\log(k/\varepsilon)/\varepsilon^2) rows, showing that sparse recovery is strictly easier to sketch than sparse regression. For sparse regression under hinge-like loss functions including sparse logistic and sparse ReLU regression, we give the first known sketching bounds that achieve o(d)o(d) rows showing that O(ÎŒ2klog⁥(ÎŒnd/Δ)/Δ2)O(\mu^2 k\log(\mu n d/\varepsilon)/\varepsilon^2) rows suffice, where ÎŒ\mu is a natural complexity parameter needed to obtain relative error bounds for these loss functions. We again show that this dimension is tight, up to lower order terms and the dependence on ÎŒ\mu. Finally, we show that similar sketching bounds can be achieved for LASSO regression, a popular convex relaxation of sparse regression, where one aims to minimize ∄Ax−b∄22+λ∄x∄1\|Ax-b\|_2^2+\lambda\|x\|_1 over x∈Rdx\in\mathbb{R}^d. We show that sketching dimension O(log⁥(d)/(λΔ)2)O(\log(d)/(\lambda \varepsilon)^2) suffices and that the dependence on dd and λ\lambda is tight.Comment: AISTATS 202

    Cold War spy satellite images reveal long-term declines of a philopatric keystone species in response to cropland expansion

    Get PDF
    Agricultural expansion drives biodiversity loss globally, but impact assessments are biased towards recent time periods. This can lead to a gross underestimation of species declines in response to habitat loss, especially when species declines are gradual and occur over long time periods. Using Cold War spy satellite images (Corona), we show that a grassland keystone species, the bobak marmot (Marmota bobak), continues to respond to agricultural expansion that happened more than 50 years ago. Although burrow densities of the bobak marmot today are highest in croplands, densities declined most strongly in areas that were persistently used as croplands since the 1960s. This response to historical agricultural conversion spans roughly eight marmot generations and suggests the longest recorded response of a mammal species to agricultural expansion. We also found evidence for remarkable philopatry: nearly half of all burrows retained their exact location since the 1960s, and this was most pronounced in grasslands. Our results stress the need for farsighted decisions, because contemporary land management will affect biodiversity decades into the future. Finally, our work pioneers the use of Corona historical Cold War spy satellite imagery for ecology. This vastly underused global remote sensing resource provides a unique opportunity to expand the time horizon of broad-scale ecological studies
    • 

    corecore